Modular content parser: YouTube + Instagram + Reddit by codeby · Pull Request #1 · codeby/claude

codeby · 2026-04-28T19:43:57Z

Summary

Refactored YouTube parser into a modular plugin system (core + plugins)
Added Instagram plugin via Apify's instagram-scraper (рилсы, посты, хэштеги, аккаунты)
Added Reddit plugin via PRAW read-only (subreddits, search, posts, users)
Generic Streamlit UI auto-renders forms from each plugin's input/settings specs
Unified CLI: python -m content_parser.cli {list-sources, run --source ...}
Old youtube_parser/ keeps working via re-export shims; the legacy CLI is now a thin translation layer over the new one
102 unit tests (stdlib unittest, no extra deps)

What's where

content_parser/
├── core/
│   ├── schema.py        # Item / Comment / Transcript dataclasses
│   ├── plugin.py        # SourcePlugin ABC + InputSpec / FieldSpec
│   ├── registry.py      # plugin discovery (logs real bugs, silent on missing optional deps)
│   ├── runner.py        # resolve → fetch → write (try/finally for partial-run safety)
│   ├── output.py        # Item → JSON / Markdown / CSV / index (path-traversal-safe)
│   ├── secrets.py       # st.secrets → env → ~/.content_parser/config.json + .streamlit/secrets.toml
│   └── errors.py
├── plugins/
│   ├── youtube/         # ported existing logic + adapter to Item
│   ├── instagram/       # Apify HTTP client (Bearer auth) + adapter + plugin
│   └── reddit/          # PRAW client + adapter + plugin
├── ui/app.py            # Streamlit UI with dynamic forms
└── cli.py               # unified CLI entry

Setup for Streamlit Cloud

Add to Settings → Secrets:

YOUTUBE_API_KEY = "AIza..."           # for YouTube plugin
APIFY_API_TOKEN = "apify_api_..."     # for Instagram plugin
REDDIT_CLIENT_ID = "xxx"              # for Reddit plugin
REDDIT_CLIENT_SECRET = "yyy"          # for Reddit plugin
REDDIT_USER_AGENT = "myapp:v1 by /u/me"  # optional, recommended

# Optional, only if YouTube transcripts get blocked from a datacenter IP:
WEBSHARE_USERNAME = "..."
WEBSHARE_PASSWORD = "..."

Security highlights (covered by tests)

_safe_filename applied to source and item_id (defense in depth against malicious upstream IDs); _file_stem appends a short hash when sanitization collapses the id, so collisions can't clobber files
Apify token sent in Authorization: Bearer … header (not query string) so it doesn't leak into nginx logs
TOML upsert escapes \ and " so secrets containing these characters round-trip cleanly
Reddit URL host validation is exact-match (.reddit.com / .redd.it) — evilreddit.com rejected
_redact_spec strips both query and fragment from URLs before they enter exception messages or logs
Streamlit secret fields are type="password"; ~/.content_parser/config.json and .streamlit/secrets.toml are written chmod 600
output/ is in .gitignore (contains scraped comments, possibly PII)

Roadmap (not in this PR)

Этап B: внешние input loaders (CSV / Google Sheets / YAML)
Этап C: cron-планировщик (cli jobs install-cron)
Опционально: Whisper-транскрипция рилсов

Test plan

pip install -r requirements.txt succeeds
python -m content_parser.cli list-sources → youtube, instagram, reddit
python -m unittest discover -s tests → 102 passed
YouTube: python -m content_parser.cli run --source youtube --video https://youtu.be/...
Instagram (needs APIFY_API_TOKEN): python -m content_parser.cli run --source instagram --account nasa --set max_posts_per_input=5
Reddit (needs REDDIT_CLIENT_ID/SECRET): python -m content_parser.cli run --source reddit --subreddit python --set listing=top --set time_filter=week
Back-compat: python -m youtube_parser.main --video URL --max-comments 10 still works
streamlit run app.py shows source selector with all 3 plugins; tabs render dynamically
Streamlit Cloud picks up the new UI on next deploy

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

CLI tool that resolves search queries, channels, playlists, or video URLs to a list of videos, then fetches top-level comments (optionally with replies) via the YouTube Data API and transcripts via youtube-transcript-api. Writes per-video JSON + Markdown plus a summary CSV and index. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

Browser-based form wraps the existing parser modules: queries, channels, playlists, and videos as separate tabs; sidebar holds API key and limits; runs stream live status into the page; results are downloadable as a single ZIP or as summary.csv. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

Falls back to the YOUTUBE_API_KEY environment variable for local runs. Wraps st.secrets access in try/except so a missing secrets.toml does not crash the app locally. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

The Streamlit app now reads in Russian end-to-end. Added Save / Delete buttons next to the API key field that write the key to ~/.youtube_parser_config.json (chmod 600). Loading order on startup: st.secrets → $YOUTUBE_API_KEY → saved file. .gitignore added to keep caches, virtualenvs, the secrets file, and parser output out of git. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

The Save button now writes the key to both ~/.youtube_parser_config.json and .streamlit/secrets.toml so it is available globally and via st.secrets in the same Streamlit project. The TOML upsert preserves any other keys in the file and deletes the file if removing the key leaves it empty. Delete clears both locations. The status caption lists every place the key is saved. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

The 1.x release replaced the static YouTubeTranscriptApi.list_transcripts class method with an instance method (api.list / api.fetch). The old code silently failed for every video because the broad except returned None on the AttributeError, so the UI always reported "no transcript". Rewrite transcripts.py against the new API and switch to a verbose return shape so callers can distinguish disabled, missing, and blocked cases. Both the Streamlit app and the CLI now report the actual reason when no transcript is produced. Pinned the dependency to >=1.0.0. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

YouTube blocks transcript requests from datacenter IPs (Streamlit Cloud, GCP, AWS), surfacing as RequestBlocked. Add a proxy_config kwarg to the transcripts module and a sidebar section in the Streamlit app to choose Webshare (rotating residential proxies) or a generic HTTP proxy. Defaults are pulled from st.secrets (WEBSHARE_USERNAME, WEBSHARE_PASSWORD, PROXY_HTTP_URL, PROXY_HTTPS_URL) or environment variables, so creds set in the Streamlit Cloud dashboard load automatically. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

Adds a source-agnostic core (schema.py with Item/Comment/Transcript dataclasses, plugin.py with the SourcePlugin ABC plus InputSpec/FieldSpec, registry.py, runner.py, secrets.py, output.py, errors.py) so additional sources can plug in alongside YouTube without touching the core. The existing YouTube modules move into content_parser/plugins/youtube/ with an adapter that converts API dicts into the new Item schema and a YouTubePlugin implementing the contract. The youtube_parser/sources.py, comments.py, and transcripts.py become one-line shims that re-export from the new location, so existing callers (app.py, youtube_parser.main) keep working unchanged. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

content_parser.cli exposes 'list-sources' and 'run --source ... --input KIND=VALUE --set KEY=VALUE'. Convenience aliases (--query, --channel, --video, --hashtag, --account, --post) and key=value setting overrides make scripted runs ergonomic. content_parser/ui/app.py renders the Streamlit interface from each plugin's input_specs() and settings_specs(), so adding a new source needs no UI changes. Sidebar manages secrets per plugin (load from st.secrets/env/config.json, save/clear buttons), the proxy block shows only when the active plugin has a proxy_provider setting. Root app.py is now a 3-line shim into content_parser.ui.app.main, so Streamlit Cloud picks up the new UI on next deploy. The legacy youtube_parser.main CLI keeps working unchanged via the back-compat shims introduced in the previous commit. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

InstagramPlugin handles three input kinds — hashtags, accounts, and direct post/reel URLs — and runs them in a single Apify actor call. The adapter maps Apify post fields (likesCount, videoViewCount, musicInfo, latestComments with nested replies) into the unified Item schema, with audio_id and audio_title surfaced under media for trend research. ApifyClient is a thin wrapper around run-sync-get-dataset-items with explicit handling of 401 (bad token) and 402 (out of credits). The plugin auto-registers via content_parser.core.registry, so the CLI and Streamlit UI pick it up without further changes — confirmed via 'python -m content_parser.cli list-sources'. Adds requests>=2.31.0 to requirements. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

- registry.py now distinguishes ImportError (optional dep missing — silent at DEBUG) from any other exception (typo, runtime bug — printed to stderr) so plugins no longer disappear without explanation. - runner.py wraps the fetch loop in try/finally; summary.csv and index.md are flushed even when fetch raises mid-iteration, so partial runs stay inspectable. The original exception is re-raised after. - secrets.py escapes backslashes and double quotes when writing values to .streamlit/secrets.toml, so a value containing a quote no longer produces a malformed TOML file that breaks st.secrets on next start. Verified with a mini test harness: TOML round-trips a value like 'a"b\\c' through tomllib, the runner produces summary.csv after a forced mid-loop crash, and registry.warns on a NameError while staying silent on a missing optional import. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

…der auth - Routing per input kind: hashtags + accounts go to one Apify call with the user-chosen resultsType ('posts' by default). Explicit post/reel URLs go to a second call with resultsType='details', since 'posts' on a single-post URL returns nothing useful. The runner sees this as one fetch generator yielding all results combined. - _normalize_account refuses URLs whose first path segment is /p/, /reel/, /explore/, etc. — those used to silently turn into a request for a username like 'p', returning empty data with no clear error. Also validates username characters against Instagram's allowed set. - resolve() raises a PluginError if a value in the post_url field doesn't look like /p/ or /reel/, so users catch the mistake before paying for a useless Apify run. - ApifyClient sends the token in the Authorization: Bearer header instead of as a ?token= query string, so it doesn't leak into nginx access logs. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

- youtube_parser/main.py is now a translation layer over content_parser.cli: it parses the original argument set ('--query', '--video', '--max-comments', '--include-replies', '--no-transcripts', etc.) and rewrites it into the new '--source youtube --set key=value' form. Removes ~150 lines of duplicated CLI logic that drifted away from the new output layout. - ui/app.py _render_field now handles a 'select' widget with no options and no default by falling back to a free-text input, so a misconfigured FieldSpec doesn't crash the whole UI. - .gitignore picks up .content_parser/ (saved-secrets dir) and .pytest_cache/. - tests/ adds 34 unittest cases (no extra dependency, runs with stdlib): TOML upsert/escape/round-trip, runner partial-run safety, Instagram account validation + per-kind dispatch + Apify Bearer auth, Apify adapter field mapping, legacy CLI flag translation. Runs via 'python -m unittest discover -s tests'. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

Three input kinds: - subreddit (name or URL, with or without 'r/' prefix) - query (full-text search across all of Reddit) - post_url (specific thread for comment analysis) - user (posts by a given Redditor — competitor tracking) Settings cover the listing knobs (hot/top/new/rising/controversial), time_filter for top/controversial, max posts per input, comment collection (top-level only by default, mirroring the YouTube plugin), and an opt-in expand_more_comments flag for users who want the full tree at the cost of slower scrapes. The adapter maps PRAW Submission/Comment objects into the unified Item schema: score / upvote_ratio / num_comments / NSFW + locked / spoiler flags / external link domain go into media; awards and post_hint go into extra. Deleted authors render as "[deleted]" rather than None. Comments are flattened with parent_id linkage so the same Markdown renderer that handles YouTube replies works unchanged. Secrets needed: REDDIT_CLIENT_ID + REDDIT_CLIENT_SECRET (free, created at reddit.com/prefs/apps as a "script" app). REDDIT_USER_AGENT is optional with a sensible default. Adds 41 new tests (75 total) covering adapter field mapping, input normalization (subreddit/user prefixes, URL parsing), reject paths (invalid chars, listing URL in post_url field), comment depth + cap behavior, and PRAW listing dispatch via mocks. praw>=7.7 added to requirements.txt. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

- _file_stem now passes source and item_id through _safe_filename, not just title. Defense in depth against an upstream API returning a malicious id like '../../etc/passwd' that would have escaped the output directory. Verified by tests that hit write_item_json / write_item_markdown with traversal attempts and assert the resulting path stays under out_dir. - _is_reddit_post_url now matches host exactly (== 'reddit.com' or endswith '.reddit.com', same for redd.it). The previous substring check let 'evilreddit.com' and 'reddit.com.evil.example' through. Tests added for the lookalike rejection plus a positive case for legitimate subdomains like old.reddit.com. - build_reddit logs a WARNING when REDDIT_USER_AGENT is unset, before falling back to a generic default. Reddit's API rules ask for a username-bearing UA; the warning surfaces the misconfiguration that would otherwise just look like flaky rate limits. - Reddit fetch errors now go through _redact_spec, which strips query strings and caps length to 80 chars. Prevents accidentally pasting a URL with ?token=... into the field and seeing it echoed back through exception messages and Streamlit logs. - README.md adds a 'Sharing scraped results' section warning that comments are written to Markdown unescaped — fine for personal viewing, but raw output/ should not be republished without a sanitizer because of Markdown link injection vectors. - 19 new tests (94 total): _safe_filename behavior, _file_stem path traversal, write_item_* containment, _is_reddit_post_url lookalike rejection + subdomain acceptance, _redact_spec behavior, and build_reddit's logging assertion via patch.dict on sys.modules. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

…ck, host symmetry Should-fix items from the second review pass: - _file_stem now appends a short sha256 prefix when item_id sanitizes to the fallback ('item'), so two items whose ids both reduce to special chars no longer clobber each other on disk. - _redact_spec also strips URL fragments (#access_token=...) in addition to query strings, since OAuth implicit-flow tokens travel there. - build_reddit now treats whitespace-only REDDIT_USER_AGENT as missing and falls back to the default with the WARNING log, instead of silently passing whitespace through to PRAW. - _normalize_subreddit and _normalize_user reject non-Reddit hosts when given a URL, mirroring _is_reddit_post_url. Cosmetic — PRAW would still hit api.reddit.com — but keeps validation symmetric. Nice-to-haves while we're here: - replace_more on expand_more=True is now hard-capped at 32 expansions (constant _MAX_REPLACE_MORE) instead of unbounded. Unbounded calls could pull thousands of comments and minutes of latency on big threads. - 'rising' listing on a user (PRAW doesn't expose it) falls back to 'new' with an INFO log so the user sees why the result differs. - _is_reddit_host extracted as a shared helper used by all three URL validators. 8 new tests (102 total) cover stem collision avoidance, fragment redaction, whitespace UA fallback, non-reddit host rejection in both normalizers, replace_more cap, and the rising→new log. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

Three input kinds: - query: groups.search → wall.get for each found community - community: screen_name / club<id> / numeric / vk.com URL - post_url: vk.com/wall<owner>_<post> Settings cover the whole pipeline: max communities per query, max posts per wall (capped at VK's 100/call), fetch_comments toggle, max comments per post (paginated via wall.getComments offsets), and comment_depth top_level vs all (with thread_items_count=10 when 'all'). The adapter resolves author names via the profiles + groups arrays returned by extended=1 calls — no extra users.get / groups.getById roundtrips. Negative owner_ids correctly map to club<id>; positive ones to id<id>. Security carry-overs from the previous reviews: - VKClient sends access_token in the POST body, never query string. - VK error_code 5/17/27/28 → AuthError; 6/9/29 → RateLimitError; rest → PluginError. UI surfaces these distinctly. - _normalize_community and _extract_wall_id reject non-VK hosts (vk.com, vk.ru, m.vk.com only — substring match would let evilvk.com through). - _normalize_community rejects VK reserved paths (feed, im, video, etc.) that would otherwise look like screen names but aren't communities. - _redact_spec strips ?query and #fragment before logging. 47 new tests (149 total): adapter field mapping for posts/comments and user vs group label resolution, normalization (screen_name / club / URL / lookalike host / reserved path), wall ID extraction, _redact_spec, client error code mapping, token-not-in-URL invariant, and fetch dispatch for query/community/post including dedupe across mixed inputs. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

…apter Should-fix items from the combined review: - _fetch_comments now checks the cap *before* every append (top-level AND reply), so depth=all on a thread with hundreds of replies no longer overshoots max_comments by one. Also short-circuits pagination using the response's `count` field instead of doing one extra round-trip just to see an empty page. - VKClient retries RateLimitError (codes 6/9/29) with exponential backoff (1s, 2s, 4s, ... up to max_rate_limit_retries=3 by default) before bubbling up. AuthError and other PluginErrors are not retried. _sleep is a static method so tests can patch it without timing flakes. - VKClient now uses a single requests.Session for the whole client lifetime, so we don't pay the TLS handshake on every API call. - post_to_item raises ValueError when owner_id or id is missing, instead of silently constructing item_id="0_0" which would collide across multiple malformed posts. - _collect_for_spec post-path no longer duplicates the group/profile-cache lookup that the adapter already does via _label_for_id; just appends (post, None) and lets the adapter resolve. Extracted the shared response-merging logic into _extract_extended. 9 new tests (158 total): retry-then-succeed, give-up-after-max-retries, auth-not-retried, top-level cap exact, depth=all overflow control, single-page short-circuit on count, multi-page pagination continues when count > page, adapter ValueError on missing fields. The earlier ClientErrorMappingTest cases were updated to patch requests.Session (not requests.post) since the client now uses a session. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

Two input kinds: - channel: @username, plain username, or t.me URL (parses recent messages) - post_url: t.me/<channel>/<msg_id> for a specific post + its comments Reuses APIFY_API_TOKEN from the Instagram plugin so a Streamlit Cloud user only configures the Apify secret once. The default actor is apify/telegram-channel-scraper but actor_id is exposed as a setting so it can be swapped (e.g. 73code/telegram-scraper) without code edits. The adapter is field-shape-defensive because different Telegram scrapers on Apify use different key names: _pick walks a list of likely keys, _reactions_total accepts a list of {emoji, count} dicts, a flat {emoji: count} mapping, or just an int. Comments embedded in the message dict (replies_data, comments, discussion, thread.items) all parse to the same Comment list. Security carry-overs from the prior reviews: - _is_tg_host does exact-match on t.me / telegram.me to reject evilt.me and t.me.evil.example - _normalize_channel rejects Telegram reserved paths (joinchat, proxy, iv, etc.) that would otherwise look like usernames - _extract_post_url rejects /c/<chatid>/ private-channel paths since the public scrapers cannot read them - _redact_spec strips ?query and #fragment before logging - post-fetch comment count is capped to max_comments_per_post even when the actor returns more 49 new tests (207 total): _pick fallback chain, _reactions_total over all three reaction shapes, message_to_item with primary and alt field names, zero-views preserved, inline-comment extraction, alternative field-name fallbacks, host validation lookalike rejection, reserved path rejection, /c/ private path rejection, dispatch one-actor-call vs two for mixed inputs, actor_id override, dedupe across channel+post, and comment cap enforcement. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

…nt, reply tree Should-fix items from the combined review: - actor_id is validated against ^[A-Za-z0-9_-]+[/~][A-Za-z0-9_.-]+$. Whitespace-only or unset falls back to the default actor cleanly; garbage like 'noslash' or '/missing' raises PluginError up-front instead of being sent to Apify and producing a confusing 404. - fetch() does ONE pass through actor results: parses each message to Item, dedupes by item_id in the same loop. The previous version parsed each message twice (once for dedupe key, once for yield) — doubles the adapter cost on big result sets. - _replies_count handles both shapes: 'replies: 42' (number) and 'replies: [...]' (list of comment dicts → use len). Previously number-only responses left media.comments_count as None. - _extract_comments now also looks at the bare 'replies' field for comment lists (not just replies_data/comments/discussion/thread). - Reply tree linkage: when a comment has reply_to_message_id (or replyToMessageId / reply_to_msg_id) and the parent is in the same fetched batch, we set parent_id accordingly so the Markdown writer can render the thread structure. Out-of-batch references stay top-level. - _is_private_channel_url helper catches t.me/c/<chat_id>/... before _extract_post_url returns None, raising an explicit PluginError that tells the user the URL is private and Apify scrapers can't read it. - _to_int defensively coerces numeric values, refusing to silently store a stray dict (e.g. {'count': 100}) in media when an actor uses an unexpected schema. Applied to views/forwards counts. - Cosmetic: media_obj computed once instead of msg.get('media') twice. 25 new tests (232 total): _to_int across all input shapes including the dict-leak guard, _replies_count for int/list/alt-keys, reply_to_message_id parent linkage with both inside and outside-batch references, dict-views does-not-leak, actor_id validation across five garbage forms plus default fallback for empty/whitespace, ApifyError → PluginError wrapping for both channels and posts paths, private /c/ URL explicit error. Also fixes a regression introduced in the previous edit pass where _channel_label lost its def line and became a continuation of _replies_count's body — caught by the test suite immediately. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

Pull values from a column or range of any Google Sheets spreadsheet and drop them into the active plugin's input tab. Same loader code will be called from the cron runner in the next step, so it's designed for both interactive and headless use. Auth uses a Google Cloud service account: paste the JSON key into the GOOGLE_SHEETS_CREDENTIALS secret, then share each target spreadsheet with the service account's email (visible via .service_account_email() helper for UX hints). Loader API (content_parser/loaders/gsheets.py): loader = GoogleSheetsLoader.from_secrets({"GOOGLE_SHEETS_CREDENTIALS": ...}) loaded = loader.load(sheet_id_or_url, tab="Communities", range_a1="A:A", skip_header=False) loaded.values # ['durov_says', 'telegram', ...] — flattened, deduped, trimmed loaded.sheet_title / loaded.tab_title / loaded.count Sidebar block "📥 Загрузить из Google Sheets" exposes the same loader under any plugin: paste creds, paste sheet URL, pick tab + range, pick which input kind (channel / community / hashtag / etc.) to populate, hit Загрузить. Loaded values append to the existing input field (preserving manual entries), so several sheets can be merged before running. Defensive behavior: - credentials JSON is validated for type/client_email/private_key keys before sending to gspread, with a clear AuthError if it's e.g. an OAuth client JSON instead of a service account key. - Sheet URL extraction tolerates the ID alone, the full /d/<id>/edit URL, and trailing query params. - A1 range validated against a permissive regex; an actual range error from the API surfaces with the user's range echoed back. - 403 from Google → AuthError with "share the sheet" hint. 404 → PluginError with "check the URL/ID". - Unknown tab name → PluginError listing the tab names that DO exist. 20 new tests (252 total): credentials validation across all four malformed forms (non-JSON string, JSON-but-not-dict, missing field, missing secret), sheet ID extraction (bare ID / full URL / URL with query / garbage / empty), load() with single column / multi column / deduplication / blank-skipping / skip_header / invalid range / unknown tab / 403 / 404 / default-first-sheet. requirements.txt: +gspread>=6.0, +google-auth>=2.20 (the latter was already a transitive dep of google-api-python-client; pinning it explicitly makes the loader self-contained for cron use later). https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

Should-fix items: - _extract_sheet_id now validates the URL host strictly (must be docs.google.com). The previous regex.search would happily pull '/d/<id>/' out of any URL, including https://evil.com/.../d/<id>/... Not an SSRF (we don't fetch the user URL — the ID just becomes a parameter to the Google Sheets API), but the silent acceptance was misleading. Lookalike hosts and other google subdomains (mail.google.com etc.) are now rejected explicitly. - validate_credentials extracted as a static method that does the shape check WITHOUT building a gspread client. The save button now validates pasted JSON via this helper before persisting, so users see "JSON невалиден: …" immediately instead of saving garbage that fails on next load. - Service-account 'type' field is now checked too: an OAuth client JSON (type=authorized_user) is rejected with a message that points the user to the right kind of credential. - All UI buttons in this block translated to Russian (Сохранить / Удалить / ✏️ Заменить / ✕ Отмена) — was English-Russian mixed. Nice-to-haves while we're here: - After creds are saved, the field collapses to a one-line summary: "✓ Учётка сохранена: bot@project.iam.gserviceaccount.com" with a hint to share the spreadsheet with that email — addresses both the "where do I find this?" UX gap and the security concern of re-rendering the full RSA private key in plain text on every load. An ✏️ Заменить button reveals the textarea again. - A warning caption above the JSON field reminds the user that the JSON contains a private key. - st.spinner around the load call so the UI shows progress feedback. - Empty / whitespace 'tab' parameter falls back to the first sheet (matters for cron configs that may pass tab=""). - raw_rows dropped from LoadedRange — was populated but never read, carried unnecessary copies of full sheet data in memory. 8 new tests (260 total): non-google host rejected (with explicit docs.google.com hint in error), lookalike host rejected, other-google subdomain rejected (mail.google.com), OAuth client JSON rejected, validate_credentials does NOT call _build_client, empty/whitespace tab fallback to first sheet, raw_rows attribute removed from LoadedRange. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

@daily

Job describes one scheduled run: source plugin, inputs (inline list and/or Google Sheets references), settings, optional cron schedule. Job-files live in ~/.content_parser/jobs/<name>.yaml and are read with yaml.safe_load to keep the door closed on !!python/object construction tricks. Schema (jobs/schema.py): - Job dataclass with validate(): rejects bad names (regex ^[A-Za-z0-9_-]{1,64}$), missing source, invalid cron expressions, jobs without any inputs, malformed sheet_inputs, unknown notify_on_failure. - SheetInput dataclass mirrors GoogleSheetsLoader.load() args. - is_valid_cron loosely accepts standard 5-token expressions and @-aliases (@daily, @Weekly, ...). It refuses garbage like 'rm -rf /' that contains characters outside [\d*/,-A-Za-z]. - resolved_output_dir() returns output/scheduled/<name>/<timestamp>/ by default, an absolute output_dir as-is, or a relative one resolved against cwd. The timestamp suffix is always appended. Store (jobs/store.py): - list_jobs() / load_job() / save_job() / delete_job() / job_exists(). - Path resolution validates the candidate is inside JOBS_DIR via Path.resolve() + relative_to() — defense in depth even though the job-name regex already keeps slashes out. - list_invalid() returns (name, error) pairs for files that fail to parse, so the UI can surface broken jobs instead of silently dropping. - save_job sets chmod 600 (best-effort). Runner (jobs/runner.py): - run_job(name) loads the YAML and runs run_job_obj(job). - _resolve_inputs merges inline values with Sheets-loaded values per input kind, then dedupes preserving insertion order, then drops empty kinds. - _collect_secrets pulls plugin secret_keys + GOOGLE_SHEETS_CREDENTIALS if any sheet_inputs present + the same WEBSHARE_/PROXY_ optional set the CLI/UI uses. - On success: writes .last_run.txt marker. On failure: writes last_error.txt with traceback unless notify_on_failure='none'. The original exception is re-raised so cron sees a non-zero exit. 48 new tests (308 total): cron expression validation across standard and alias forms (and rejection of cmd-injection-shaped garbage), Job validation across every guard (bad name / no source / no inputs / invalid schedule / malformed sheet ref / unknown notify), YAML round-trip + safe_load enforcement (rejects !!python/object), name_hint fallback when YAML omits 'name', range vs range_a1 alt key, resolved_output_dir for default/relative/absolute, store CRUD with path-traversal rejection, list_jobs sorting + skip-invalid behavior, chmod 600 on save, runner input merge with dedupe, secret collection (plugin / sheets-needed / optional proxy), success/failure marker writing, notify_on_failure=none suppresses error file, empty resolved inputs raise PluginError before plugin is touched. requirements.txt: +pyyaml>=6.0. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

cron.py manages a marker-bounded block in the user's crontab without ever touching lines outside our markers: # >>> content_parser jobs >>> 0 6 * * MON cd /repo && python -m content_parser.cli jobs run weekly # job:weekly # <<< content_parser jobs <<< API: - install_cron(jobs=None, project_root=None, python_executable=None, log_path=None) — collects every job with a schedule, regenerates the managed block. Idempotent: running twice with the same jobs yields the same crontab. Existing user lines outside the markers are preserved. - remove_cron() — strips the block, returns True/False. - read_block() — best-effort parse of currently-installed entries (schedule, job_name, command). Safety: - shlex.quote on every path/argument that goes into the cron command, so even a hypothetical bad job name (which the schema regex already rejects) couldn't inject extra shell metacharacters. - Friendly errors for missing crontab binary and 'no crontab' state. CLI subcommand `jobs`: - jobs list → tabulated overview of all saved jobs + invalid files - jobs show <name> → dump a job's canonical YAML - jobs run <name> → invoke run_job() with stdout logging and progress - jobs install-cron → regenerate the managed block - jobs remove-cron → strip the managed block - jobs cron-status → show what's currently in the block 18 new tests (326 total): _strip_block leaves outside lines untouched and handles block-at-start, _build_block produces marker-wrapped lines with # job:<name> footer, build_command_for_job shell-quotes paths with spaces and uses safe paths as-is, install_cron idempotency across runs, jobs without schedule are skipped, lines outside markers preserved through reinstall, remove_cron only writes when block exists, read_block parses entries back, _existing_crontab returns "" for the "no crontab for user" case but raises on real errors and on missing binary. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

New '🕐 Расписание' section appears at the bottom of every plugin's page. Lets the user: 1. List existing jobs with collapsed details: source, schedule (or "ручной запуск" badge), description, inline inputs and Sheet refs summary. 2. Per job: ▶️ Запустить (calls run_job_obj with live log), ✏️ Изменить (raw YAML editor with Save/Cancel), 🗑️ Удалить. 3. ➕ Создать job из текущего состояния — captures the current input tabs + plugin settings into a new YAML file. Bare-minimum form: name, optional cron, optional description; sheet_inputs added by editing the YAML afterward (since they need URL/tab/range fields). 4. 📅 Cron section, automatically grayed out on hosts without `crontab` binary (Streamlit Cloud) — there it shows a copy-paste GitHub Actions workflow as the alternative path. On hosts with crontab: install / remove buttons + summary of currently-installed entries. UI gracefully surfaces invalid YAML files via list_invalid() so a user who hand-edited a file and broke it can see the parse error instead of having the job silently disappear. is_cron_available() helper added to jobs/cron.py — runs a one-shot `crontab -l` and catches FileNotFoundError. UI calls it once per render to decide whether to show the install/remove buttons or the GH Actions template. Run button label updated to "▶️ Запустить (разово)" to disambiguate from the per-job ▶️ buttons in the Schedule panel. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

…friendly CLI Should-fix items: - inputs YAML parser now refuses non-list values per kind. The previous comprehension iterated strings character-by-character, so the typo 'community: durov_says' (no brackets) silently produced ['d','u','r',...]. Fix raises a clear PluginError before the typo can corrupt a run. - Job.validate() now rejects '..' anywhere in output_dir parts. Absolute paths still go through (user explicitly opts in), but the path-traversal case '../../etc' or 'custom/../escape' is caught at validation. - build_command_for_job rejects newlines and carriage returns in any path fragment (project_root, python_executable, log_path) and in job.name. shlex.quote happily preserves a literal \n inside its single-quoted output, which would split a crontab entry across two lines and corrupt the file. The schema's job-name regex already covers job.name, but the defense is added there too for future-proofing. - cli jobs run wraps run_job in try/except for AuthError, PluginError and KeyError (unknown source from get_plugin), printing a friendly stderr message and returning exit code 1 instead of dumping a Python traceback. - run_job_obj now computes resolved_output_dir() ONCE up-front. Earlier, a Sheets-load failure or empty-resolved-inputs would call job.resolved_output_dir() twice — once for the eventual run, again to pick a place for last_error.txt — producing two timestamped directories that differ by milliseconds. Now both markers land in the same dir. 12 new tests (338 total): output_dir rejected with .. at start / middle, absolute and normal-relative output_dirs accepted, string / int / dict values in inputs raise on YAML load (with the "must be a list" hint), empty input value treated as empty list, newline rejected in project_root / python_executable / log_path, carriage return rejected. The empty-resolved-inputs test in test_jobs_runner already verified the single-out_dir behavior end-to-end (passes with the refactor). https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

New content_parser/transcription/ module wires yt-dlp + OpenAI Whisper into the existing Item.transcript field. When the user enables 'transcribe_videos' in a plugin's settings, each Item with a video URL is downloaded as audio (MP3 64 kbps, well under the 25 MB Whisper API limit), shipped to api.openai.com/v1/audio/transcriptions, and the verbose_json segments are mapped onto the existing Transcript schema so the Markdown writer renders them the same way as YouTube subtitles. Module layout: - downloader.py — yt-dlp wrapper with FFmpegExtractAudio postprocessor and a 24 MB filesize cap. get_duration_seconds() probes without downloading for budget gating. - whisper_api.py — minimal Bearer-auth HTTP client (just `requests`, no `openai` package). Distinguishes 401 (bad key), 429 (rate limit), and other 4xx/5xx with the API's error message. - cache.py — ~/.content_parser/transcription_cache/<source>_<id>.json, so re-running a job doesn't re-pay for previously transcribed items. - runner.py — maybe_transcribe(item, settings, secrets, only_if_missing=) is the single entry point plugins call. Order: cache check → duration cap → download → API → cache write. Plugin integration: - Instagram, VK, Telegram add `transcribe_videos` (bool, default off) and `max_audio_seconds_per_video` (default 600) FieldSpecs and call maybe_transcribe inline in fetch(). - YouTube treats Whisper as a fallback: only_if_missing=True means it runs only when youtube-transcript-api couldn't return segments (subs disabled, blocked, etc.). Avoids wasting API on videos that already have free subtitles. UI: - Sidebar shows an inline 'Параметры Whisper' expander when transcribe_videos is checked, with OPENAI_API_KEY input + save/clear buttons + caption about the cost and ffmpeg requirement. - OPENAI_API_KEY is in the optional shared-secrets list, so a saved value is picked up across plugins and by the cron runner. Security carry-overs: - Token in Authorization: Bearer header, never URL. - _video_url_for prefers the canonical post URL (e.g. instagram.com/reel/AAA/) over CDN URLs in media.video_url, since CDN tokens often expire while yt-dlp can re-resolve from the post URL fresh. - Cache filenames go through _safe regex so a malicious upstream id like '../../etc' can't escape the cache dir. - Hard cap on audio duration before download blocks surprise costs. 24 new tests (362 total): cache CRUD with path-traversal sanitization; whisper_api Bearer header / verbose_json format / language passthrough / 401 / 429 / other-error message extraction / valid response parsing; maybe_transcribe disabled-by-setting / no-key-sets-error / cache-hit- skips-network / full-pipeline-downloads-and-caches / duration-cap-blocks / download-failure-recorded / whisper-failure-recorded / only_if_missing-skips-when-present / only_if_missing-runs-when-empty / no-video-url-silent / prefers-canonical-url-over-cdn. requirements.txt: +yt-dlp>=2024.0. ffmpeg required at runtime (documented in plugin help text). https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

…ion pin Should-fix items: - runner.maybe_transcribe now refuses URLs that aren't public HTTP(S): loopback hostnames (localhost / 0.0.0.0), IPv4/IPv6 literals in private RFC1918 ranges, link-local (169.254.0.0/16 incl. AWS metadata), reserved and loopback. yt-dlp would otherwise happily fetch from internal networks if any third-party API (Apify/VK/Telegram actor) ever returned such a URL — chain-of-trust SSRF. Bare DNS names still pass since resolution happens later in yt-dlp; this layer only catches literals. - runner.maybe_transcribe blocks transcription when get_duration_seconds() returns None. Without a known length the per-video Whisper bill is unbounded; refusing is the cheap-and-safe default. Earlier code fell through this branch and would download anyway. - whisper_api.transcribe_audio retries 429 (rate limit) and 5xx (server error) up to max_retries=2 with exponential backoff (2s, 4s). 401/4xx other than those surface immediately. _sleep is a module-level helper so tests patch it without slowing the suite — TranscribeAudioTest's test_429_rate_limit was updated to use max_retries=0 for the no-retry semantic. Nice-to-haves: - yt-dlp pinned to >=2024.0,<2027.0 to bound supply-chain blast radius if a future major version ever ships a malicious extractor. - UI caption under Параметры Whisper now mentions that the saved key persists across checkbox toggles — only 🗑️ removes it. 17 new tests (379 total): _is_public_url across normal URLs, http variant, non-http schemes, localhost / 0.0.0.0 / 127.0.0.1 / IPv6 ::1, RFC1918 (10/172.16/192.168), link-local 169.254 (AWS metadata), IPv6 fc00::/7, empty/invalid input, DNS names pass through; runner blocks on private URL before any download; runner blocks on duration unknown; Whisper retry on 429-then-success / 500+503-then-success / exhausted retries; 401 and 400 do NOT retry (single call only). Existing test_429_rate_limit adjusted for new retry semantics. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

Three cross-cutting issues from the full-project review, in one batch: 1. GitHub Actions tests workflow (.github/workflows/tests.yml). Runs on every push and PR-to-main against Python 3.11 and 3.12, installs requirements.txt, runs `python -m unittest discover -s tests -v`, and smoke-checks `cli list-sources`. No more silent regressions between manual reviews. 2. _redact_spec was reimplemented in three plugins (Reddit, VK, Telegram), each with the security-relevant job of stripping ?query and #fragment from URLs before they hit logs or exception messages. When we added fragment-stripping to Reddit, the others were missed for a release. Extracted to content_parser/core/redact.py as redact_spec(). All three plugins now import the single canonical implementation; tests import from core.redact too (aliased to _redact_spec locally to keep diffs small). 3. ApifyClient lived in plugins/instagram/apify_client.py and Telegram imported from there — runtime cross-plugin dependency that would silently break if Instagram were renamed or removed. Moved to content_parser/clients/apify.py (a new top-level package for shared HTTP clients). Both Instagram and Telegram now import from the shared module; Instagram's old file is deleted. Tests adjusted to patch the new module path (content_parser.clients.apify.requests.post). All 379 tests still pass after the moves. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

New 'instagram_graph' plugin alongside the existing public 'instagram' (Apify) plugin. Different tool for different jobs: - 'instagram' — публичные посты любого аккаунта (Apify, $$$) - 'instagram_graph' — твои посты + insights (Meta Graph API, бесплатно) What it gives that Apify can't: - Insights — reach, impressions, plays, saved, shares, total_interactions on your own Reels and posts - Full comment threads with replies and like counts - No per-item Apify cost - Stable, Meta-supported endpoint Files: - plugins/instagram_graph/client.py — GraphClient over graph.facebook.com with retry-on-429/5xx exponential backoff (2s, 4s), pagination via paging.next URL walking, embedded-token replacement so a 'next' URL can't smuggle a different token through, error-code mapping (190/102/etc → AuthError; 10/200/803 → AuthError "permissions"; 4/17/32/613 → RateLimitError). - plugins/instagram_graph/adapter.py — media_to_item maps a Graph media object (IMAGE/VIDEO/REEL/CAROUSEL_ALBUM) to core.Item with insights flattened into media dict; flatten_comments folds inline replies (replies.data) into the flat Comment list with parent_id. - plugins/instagram_graph/plugin.py — InstagramGraphPlugin with two inputs: 'account' (Business Account ID, 15-20 digits regex-validated) and 'post_id' (numeric media ID). Settings: max_posts_per_account, fetch_comments / fetch_replies / fetch_insights toggles, max_comments_per_post, plus the standard transcribe_videos / max_audio_seconds_per_video pair. Whisper integration via the same maybe_transcribe call as other video plugins. Plumbing: - registry.py registers the new plugin alongside the existing five. - jobs/runner.py adds INSTAGRAM_ACCESS_TOKEN to the optional secrets list, so cron jobs pick it up automatically. - ui/app.py shared-secrets list extended too. Auth requirements (documented in plugin.py docstring): Convert IG account to Business/Creator → connect to a FB Page → create Meta Developer App → generate long-lived token via Graph API Explorer with scopes instagram_basic, instagram_manage_comments, pages_show_list, business_management → store as INSTAGRAM_ACCESS_TOKEN. Insights are best-effort: if the /insights call returns a permissions error (common on archived posts or older media), we swallow it and continue with the rest of the run instead of dying. 42 new tests (421 total): client (token always overrides embedded ones, 401/code-10/code-4/5xx error mapping, retry-on-429-then-success, retry exhaustion, pagination across pages, max_items early-exit, embedded- token override on next URL); adapter (insights envelope flattening across dict/list shapes, media_to_item field mapping for REEL + non-REEL, owner_username override, missing-id raises, falls back gracefully when owner_username not passed, comment_to_core for top and reply, flatten_comments two-level expansion); plugin (resolve validates account-id length and post-id format, dedupe across inputs, fetch dispatch for account-path / post-path / mixed-with-dedupe, fetch_insights=False skips /insights endpoint entirely, insights failure does NOT abort the run). https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

Should-fix items: - GraphClient now scrubs the access token from any RequestException message before raising. The `requests` library sometimes embeds the full URL — including ?access_token=… — in connection-error messages, which would otherwise propagate to last_error.txt / Streamlit logs / CLI stderr. The exception is re-raised with `from None` so the chained __cause__ doesn't keep the unredacted original around either. - is_reel boolean now has explicit parens — (media_type == "REEL") or (media_type == "VIDEO" and product_type=="REELS") — instead of relying on Python's `and > or` precedence, which is easy to misread. - media_to_item accepts insights as a keyword argument instead of having callers mutate `media["insights"]`. The plugin now passes the freshly- fetched insights data through; media dict stays read-only. - Stale comment about a non-existent _get_url helper replaced with accurate description of what get_paginated actually does. 4 new tests (425 total): RequestException with the token in its message gets [REDACTED] in the propagated PluginError; chained __cause__ is None so the secret doesn't leak through traceback.format_exc; 5xx-then-5xx- then-success retries with exponential backoff (mirrors the existing 429 test); insights metric selection differs for REEL vs IMAGE media types (REEL gets plays+total_interactions, IMAGE gets impressions+reach+saved). https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

…c cache, Cloud detection, status file Six findings from the project-wide audit landed in one batch: 1. CSV formula injection guard (core/output.py). Excel/Sheets/LibreOffice execute any CSV cell starting with =, +, -, @, \t, \r as a formula (=cmd|'/c calc'!A1 is the canonical RCE proof-of-concept). User- controlled fields like title and author can come straight from Apify/Reddit/YouTube comments, which means any of our scrapes could ship a CSV that runs shell commands when a non-technical viewer opens it in Excel. _csv_safe prepends a single quote to neutralize the formula while keeping the value visible. Applied to every string column going into summary.csv. 2. Token redaction in last_error.txt (jobs/runner.py). The previous implementation wrote `traceback.format_exc()` raw — and tracebacks carry the chained exception's message, which can include API URLs with ?access_token=… in the query (we redact at the source for Instagram Graph but not in every other plugin's exception path). _record_failure now scrubs every secret value it knows about (8+ chars only, to skip noise) before the file lands on disk. Both call sites pass `secrets=secrets` from collect_secrets. 3. YouTube replies cap honored (plugins/youtube/comments.py). When include_replies=True, fetch_comments used to call _fetch_all_replies without bound — a single popular top-level comment with 500 replies would return 1+500 items only for `comments[:max_comments]` to throw most of them away. The fix threads `remaining = max_comments - len(comments)` through to _fetch_all_replies, which now stops both inside the inline loop and at page boundaries. Also requests page sizes proportional to remaining quota. 4. Atomic transcription cache (transcription/cache.py). put() now writes to <name>.json.tmp and renames over the final path. POSIX guarantees rename atomicity, so a crash during the write leaves either the old value or the new value, never a half-written JSON that get() catches as ValueError and silently treats as cache miss. 5. Streamlit Cloud detection in secrets layer (core/secrets.py). .streamlit/secrets.toml is managed by the Cloud dashboard and read-only at the filesystem level. Detect via STREAMLIT_RUNTIME env, STREAMLIT_SHARING, or HOSTNAME=streamlit-* and skip the file write entirely — local config.json (the other write target) still persists so the value works for the current container; users mirror it via Settings → Secrets for next deployment. 6. Unified .last_status.json (jobs/runner.py). Both _record_success and _record_failure now write a single canonical status file that monitoring / UI can stat once for "is this job healthy?". Schema: {job, source, status, finished_at, items, error}. Atomic write via .tmp+replace as well. 17 new tests (442 total): _csv_safe across all five injection-prone prefixes (= + - @ \t \r) and the safe-string / None / non-string passthrough cases; an end-to-end summary.csv test that injects a malicious title and verifies the round-tripped DictReader sees the quoted form. record_failure-redaction (passes a secret value, expects [REDACTED] in last_error.txt) and the new status-file shape. Cache atomicity (no .tmp left after success, second put() replaces first). Streamlit Cloud detection (STREAMLIT_RUNTIME=cloud → no file written; empty env → file written normally). https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

claude and others added 11 commits April 28, 2026 19:24

Added Dev Container Folder

47fa108

Read API key from st.secrets on Streamlit Cloud

558ddba

Falls back to the YOUTUBE_API_KEY environment variable for local runs. Wraps st.secrets access in try/except so a missing secrets.toml does not crash the app locally. https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

codeby changed the title ~~Add YouTube parser for comments and transcripts~~ Modular content parser: YouTube + Instagram plugins Apr 29, 2026

claude added 6 commits April 29, 2026 07:34

codeby changed the title ~~Modular content parser: YouTube + Instagram plugins~~ Modular content parser: YouTube + Instagram + Reddit Apr 29, 2026

claude added 11 commits April 29, 2026 09:41

claude added 5 commits April 29, 2026 11:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modular content parser: YouTube + Instagram + Reddit#1

Modular content parser: YouTube + Instagram + Reddit#1
codeby wants to merge 33 commits into
mainfrom
claude/cloud-setup-4v9hy

codeby commented Apr 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

codeby commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's where

Setup for Streamlit Cloud

Security highlights (covered by tests)

Roadmap (not in this PR)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codeby commented Apr 28, 2026 •

edited

Loading